perf: cap planner budget when model dwarfs the streaming budget by fszontagh · Pull Request #1612 · leejet/stable-diffusion.cpp

fszontagh · 2026-06-06T16:11:37Z

Summary

When the model is much bigger than --max-vram, the planner currently merges the base segments into 1-2 huge merged segments. The follow-up worst_merged_segment_footprint reservation in compute_streaming_segments then leaves almost nothing for chunk-K residency, so on a model like Z-Image bf16 (11.7 GB on a 12 GB GPU) chunk-K stays near 0 and every sampling step re-uploads the full model.

When the model fits comfortably in the budget the current behaviour is already optimal: one big merged segment, chunk-K covers the whole model, no per-step H2D.

This PR caps the budget passed to resolve_plan at a quarter of the streaming budget when total_params_bytes > 0.75 * effective_budget, otherwise it passes the full budget. Smaller merged segments shrink the worst_merged_segment_footprint reservation downstream, which frees enough residency budget for a meaningful chunk-K. Small/quantized models are unaffected.

Quantization-aware: total_params_bytes is summed via ggml_nbytes, so a Q8/Q4 model is correctly identified as small relative to the budget.

Related Issue / Discussion

Continuation of the streaming-budget series #1576, #1598, #1601, #1611.

Additional Information

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload	Before	After
SDXL bf16 1152x896 batch=2 8 steps	21.6 s	20.9 s
Z-Image bf16 1024x688 batch=2 9 steps	138 s	98 s

SDXL stays in the full-budget path so the plan is unchanged (1 merged segment, whole UNet resident). Z-Image takes the T/4 path, the planner produces 9-11 merged segments instead of 2, and chunk-K grows from ~0 to ~3.5 GB, dropping per-step H2D enough for a ~30% wallclock win.

We tested T/2, T/3, T/5, T/6, T/8 on the Z-Image workload; T/4 is the empirical optimum on the test hardware (smaller fractions trade chunk-K headroom for per-dispatch overhead and regress).

Checklist

I have read and confirmed this PR follows the contribution guidelines.

…ed-segments

leejet · 2026-06-07T14:34:44Z

The heuristic makes sense for --stream-layers, but this currently applies before the stream_layers_enabled branch, so large models can get effective_budget / 4 planner merges even when streaming is disabled. That changes non-streaming behavior and may increase dispatch/segment overhead without providing chunk-K residency benefits. Can we gate the cap on stream_layers_enabled?

fszontagh · 2026-06-07T20:29:55Z

Good catch - gated on stream_layers_enabled in 88a5ee4. Non-streaming path now keeps the full effective_budget.

…ed-segments

fszontagh added 3 commits June 6, 2026 17:16

perf: cap planner budget when model dwarfs the streaming budget

2c830f5

Merge remote-tracking branch 'upstream/master' into perf/smaller-merg…

f429272

…ed-segments

Merge remote-tracking branch 'upstream/master' into perf/smaller-merg…

33a84ba

…ed-segments

perf: gate planner budget cap on stream_layers_enabled

88a5ee4

Merge remote-tracking branch 'upstream/master' into perf/smaller-merg…

1f8c609

…ed-segments

leejet merged commit 17a2b4a into leejet:master Jun 8, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cap planner budget when model dwarfs the streaming budget#1612

perf: cap planner budget when model dwarfs the streaming budget#1612
leejet merged 5 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments

fszontagh commented Jun 6, 2026

Uh oh!

leejet commented Jun 7, 2026

Uh oh!

fszontagh commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fszontagh commented Jun 6, 2026

Summary

Related Issue / Discussion

Additional Information

Checklist

Uh oh!

leejet commented Jun 7, 2026

Uh oh!

fszontagh commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants